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ABSTRACT 



Method and apparatus for efficiently performing nearest 
neighbor queries on a database of records wherein each 
record has a large number of attributes by automatically 
extracting a multidimensional index from the data. The 
method is based on first obtaining a statistical model of the 
content of the data in the form of a probability density 
function. This density is then used to decide how data should 
be reorganized on disk for efficient nearest neighbor queries. 
At query time, the model decides the order in which data 
should be scanned. It also provides the means for evaluating 
the probability of correctness of the answer found so far in 
the partial scan of data determined by the model. In this 
invention a clustering process is performed on the database 
to produce multiple data clusters. Each cluster is character- 
ized by a cluster model. The set of clusters represent a 
probability density function in the form of a mixture model. 
A new database of records is built having an augmented 
record format that contains the original record attributes and 
an additional record attribute containing a cluster number for 
each record based on the clustering step. The cluster model 
' uses a probability density function for each cluster so that 
the process of augmenting the attributes of each record is 
accomplished by evaluating each record's probability with 
respect to each cluster. Once the augmented records are used 
to build a database the augmented attribute is used as an 
index into the database so that nearest neighbor query 
analysis can be very efficiently conducted using an indexed 
look up process. As the database is queried, the probability 
density function is used to determine the order clusters or 
database pages are scanned. The probability density function 
is also used to determine when scanning can stop because 
the nearest neighbor has been found with high probability. 

39 Claims, 6 Drawing Sheets 
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DENSITY-BASED INDEXING METHOD FOR the database are most likely to be the "nearest" entries to the 

EFFICIENT EXECUTION OF HIGH query. The job then is to search only the set of candidate 

DIMENSIONAL NEAREST-NEIGHBOR matches after an index scan is conducted. Note that the 

QUERIES ON LARGE DATABASES query can be just a record from the database and the answer 

5 is determining other records similar to it. Another example 

FIELD OF THE INVENTION is a n image query where the answer is images "similar" to 

The present invention concerns a database management mis b ? somc ^ defincd distance metric * 

system (DBMS) for storing data and retrieving the data As an example, consider a database that contains the ages, 

based on a data access language such as SQL. One major use incomes, years of experience, number of children, etc. on a 

of database technology is to help individuals and organiza- 10 set of people. If it was known ahead of time that queries will 

tions make decisions and generate reports based on the data be issued primarily on age, then age can be used as a 

contained within the database. This invention is also appli- clustered-index (i.e. sort the data by age and store it in the 

cable to the retrieval of data from non-traditional data sets, database in that order). When a query requesting the entry 

such as images, videos, audio, and mixed multimedia data in whose age value is nearest some value, say 36 years, then 

general. 15 one only need visit the relevant parts of the database.^ 

However, indexing rapidly becomes difficult to perform if 

BACKGROUND ART one adds more dimensions to be indexed simultaneously. In 

An important class of problems in the areas of database fact > as the number of indexes grows, the size of the index 

decision support and analysis are similarity join problems, structure dominates the size of the database, 

also known as nearest-neighbor queries. The basic problem This problem has been addressed in prior work on index 

is: given a record (possibly from the database), find the set structures (including TV- trees, SS-Tree, SR-Trees, X-Trees, 

of records that are "most similar*' to it. The term record here KD-trees, KD-epsilon-Trees, R-trees, R+-trees, R*-trees, 

is used in general to represent a set of values, however the VA-File) and methods for traversal of these structures for 

data can be in any form including image files, or multimedia nearest neighbor queries. Any indexing method has to 

files, or binary fields in a traditional database management 25 answer three questions: 1. how is the index constructed? 2. 

system. Applications are many and include; marketing, how is the index used to select data to scan? and 3. how is 

catalog navigation (e.g. look-up products in a catalog similar the index used to confirm that the correct nearest neighbor 

to another product), advertising (especially on-line), fraud has been found? " 

detection, customer support, problem diagnosis (e.g. for 3Q Prior art approaches had serious trouble scaling up to 

product support), and management of knowledge bases. higher dimensions. Almost always a linear scan of the 

Other applications are in data cleaning applications, espe- database becomes preferable between 20 and 30 dimensions, 

cially with the growth of the data warehousing market. For traditional indexing scheme this happens at 5 dimen- 

It has been asserted in the database literature that the only sions or even lower. Statistical analysis of k-d tree and 

way to answer nearest neighbor queries for large databases 35 R-tree type indexes confirms this difficulty; see Berchtol S., 

with high dimensionality (many fields) is to scan the Bohm C, Keim, D., Kriegel, H.-R: "Optimized Processing 

database, applying the distance measure between the query of Nearest Neighbor Queries in High-Dimensional Space", 

object and every record in the data. The primary reason for submitted for publication, 1998 and Berchtold S., Bohm C, 

this assertion is that traditional database indexing schemes Keim D. and Kriegel H. P.: "A Cost Model for Nearest 

all fail when the data records have more than 10 or 20 fields ^ Neighbor Search in High Dimensional Space", ACM PODS 

(i.e. when the dimensionality of the data is high). Consider Symposium on Principles of Database Systems, Tucson, 

databases having hundreds of fields. This invention provides Ariz., 1997. Indexes work by partitioning data into data 

a method that will work with both low dimensional and high pages that are usually represented as hyperectangles or 

dimensional data. While scanning the database is acceptable spheres. Every data page that intersects the query ball (area 

for small databases, it is too inefficient to be practical or 45 that must be searched to find and confirm the nearest 

useful for very large databases. The alternative is to develop neighbor point) must be scanned. In high-dimensions for 

an index, and hope to index only a small part of the data hyperrectangles and spheres, the query ball tends to intersect 

(only a few columns but not all). Without variation, most all the data pages. This is known as the "curse of dimen- 

(probably all) schemes published in the literature fail to sionality". In fact, in a recent paper, by Beyer K., Goldstein 

generalize to high-dimensionality (methods break down at 50 J., Ramakrishnan R., Shaft U, "When is Nearest Neighbor 

about 20 dimensions for the most advanced of these Meaningful?" submitted for Publication, 1998, the question 

approaches, at 5 or 6 for traditional ones). is raised of whether it makes sense at all to think about 

This problem is of importance to many applications nearest neighbor in high -dimensional spaces, 

(listed above) and generally is a useful tool for exploring a Most analyses in the past have assumed that the data is 

large database or answering "o^Tery-by-example" type que- 55 distributed uniformly. The theory in this case does support 

ries (e.g., the graphic image closest to a given image). Hence the view that the problem has no good solution. However, 

it can be used as a means for extending the database and most real-world databases exhibit some structure and regu- 

providing it with a more flexible interface that does not larity. In fact, a field whose values are uniformly distributed 

require exact queries (as today's SQL requires). It can also is usually rare, and typically non-informative. Hence it is 

be used to index image data W other multi-media data such 60 unlikely that one would ask for nearest neighbor along 

as video, audio, and so forth. values of a uniformly distributed field. In this invention the 

Most practical algorithms work by scanning a database statistical structure in the database is exploited in order to 

and searching for the matches to a query. This approach is optimize data access. 

no longer practical when the database grows very large, or citk*™ adv nv tuc ixrvPNixiriM 

when the server is real-time and cannot afford a long wait 65 SUMMARY OF THE INVENTION 

(e.g. a web server). The solution is to create a multi- The present invention exploits structure in data to help in 

dimensional index. The index determines which entries in evaluating nearest neighbor queries. Data records stored on 
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a storage medium have multiple data attributes that are FIGS. 6A and 6B depict subdivisions of a cluster into 

described or summarized by a probability function. A near- segments in accordance with an alternative embodiment of 

est neighbor query is performed by assigning an index for the invention; 

each of the records based upon the probability function and pj GS 7 ^ 9 m tables illustrating nearest neighbor 

then efficiently performing the nearest neighbor query. 5 results achieyed ^ rf an exemplary 

In a typical use of the invention, the data is stored with the embo dhnent of the invention; 

help of a database management system and the probability 

function is determined by performing a clustering of the data F1G 9A 15 a ^° dimensional lUustration of why high 

in the database. The results of the clustering are then used to dimensional data records make nearest neighbor inquiries 

create a clustered-index structure for answering nearest 1Q difficult; and 

neighbor queries on the data stored in the database. The FIG. 9B is a depiction of three data clusters showing one 

clustering identifies groups in the data consisting of ele- stable cluster and two unstable clusters, 
ments that are generally "more similar" to each other than 

elements in other groups. The clustering builds a statistical DETAILED DESCRIPTION OF EXEMPLARY 

model of the data. This model is used to determine how the EMBODIMENT OF THE INVENTION 

data should be partitioned into pages and also determines the 1 ^ preseDt invention has particular utility for use in 

order in which the data clusters or pages should be scanned. answerin g que ries based on probabilistic analysis of data 

The model also determines when scanning can stop because contained in a database 10 (FIG. 2A). Practice of the 

the nearest neighbor has been found with very high- mverj ti on identifies data partitions most likely to contain 

probability. ^ relevant data and eliminates regions unlikely to contain 

Preliminary results on data consisting of mixtures of relevant data points. The database 10 will typically have 
Gaussian distributions shows that if one knows what the many records stored on multiple, possibly distributed stor- 
model is, then one can indeed scale to large dimensions and age devices. Each record in the database 10 has many 
use the clusters effectively as an index. Tests have been attributes or fields. A representative database might include 
conducted with dimensions of 500 and higher. This assumes 2$ agej income, number of years of employment, vested pen- 
that the data meets certain "stability" conditions that insure s j on benefits etc. A data mining engine implemented in 
that the clusters are not overlapping in space. These condi- software running on a computer 20 (FIG. 1) accesses data 
tions are important because they enable a database design st0 red on the database and answers queries, 
utility to decide whether the indexing method of this inven- ^ ^ d icted ^ FIG 2fi indudes the 
tion is likely to be useful for a given database. It is also 3Q 0 fp ro ducing 12 a cluster model. Most preferably this model 
useful at run-ume by providing the query optimizer com- proyides a best . fit mixtU re-of-Gaussians used in creating a 
ponent of the database system with information it needs to probabilit y density for the data . Qnce the cluster 
decide the tradeoff between executing an index scan or model has bcen crcated> aQ timal B decision st 13 
simply doing a fast sequential scan. fa used tQ assign ^ data ^ from tfae 1Q _ ^ 

An exemplary embodiment of the mvention evaluates 35 a duster. Finally the data is sorted by their cluster assign- 
data records contained within a database wherein each ment and used to index 14 the database or optionally created 
record has multiple data attributes. A new database of an augmented database 10' (FIG. 2A) having an additional 
records is then built having an augmented record format that a ttribute for storing the cluster number to which the data 
contains the original record attributes and an additional point is assigned. As an optional step in the indexing process 
record attribute containing a cluster number for each record ^ one can ^ whether it makes sense to index based upon 
based on the clustering model; Each of the records that are cluster num b er . If the data in the database produces unstable 
assigned to a given cluster can then be easily accessed by clusters as that term is defined below, then a nearest neigh- 
building an index on the augmented data record. The process bor qucry using probability information may make little 
of clustering and then building an index on the record of the and me indexing on cluster number will not be 
augmented data set allows for efficient nearest neighbor 45 conducted. 

searching of the database. 0ne ^ of the cmstering mode l derived from the database 
This and other objects, advantages and features of the t0 ^wer nearest neighbor queries concerning the data 
invention will become better understood from the detailed records in the database. FIG. 2B depicts a query analysis 
description of an exemplary embodiment of the present component QC of the invention. Although both the cluster- 
invention which is described in conjunction with the accom- 50 mg component C and the query component QC are depicted 
panying drawings. in pjG. 2B, it is appreciated that the clustering can be 
BRIEF DESCRIPTION OF THE DRAWINGS performed independently of the query. The query analysis 

FIG. 1 is a schematic depiction of a computer system for ' 0m ££rf f QC findS ^ P roba J^y a neares < nei g h " 

t . . \ . bor (NN) of a query point Q presented as an input to the 

use in practicing the present mvention; v , • * ™_ . . r ~ . . 

, • * • • , , . . 55 query analysis component. The nearest neighbor of Q is then 

*1G. 2A is a schematic depiction ol a data mining system found 

in one of two ways. A decision step 15 determines 

constructed in accordance with an exemplary embodiment whcther a compkte ^ of me i& mQre cffidcnt 

ol the present invention; than a probabalistic for the nearest neighbor (NN). If 

FIG. 2B is a flow chart documenting the processing steps me comp lete scan is more efficient, the scan is performed 16 

of an exemplary embodiment of the invention; 60 and me nearest aeighbor identified. If not, a region is chosen 

FIGS. 3A and 3B schematically illustrate data clusters; 17 baS ed on the query point and that region is scanned 18 to 

FIGS. 4A-4D depict data structures used during repre- determine the nearest neighbor within the region. Once the 

sentative clustering processes suitable for use in practicing nearest neighbor (NN) in the first identified region is found, 

the present invention; a test is conducted 19 to determine if the nearest neighbor 

FIG. 5 is a depiction in one dimension of three Gaussians 65 has been determined with a prescribed tolerance or degree of 

corresponding to the three data clusters depicted in FIGS. 3A certainty. If the prescribed tolerance is not achieved a branch 

and 3B; is take to identify additional regions to check for a nearest 
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neighbor or neighbors. Eventually the nearest neighbor or 
neighbors are found with acceptable certainty and the results 
are output from the query analysis component QC. 

To illustrate the process of finding a nearest neighbor 
outlined in FIG. 2B consider the data depicted in FIGS. 3A 5 
and 3B. FIG. 3 A is a two dimensional depiction showing a 
small sampling of data points extracted from the database 
10. Such a depiction could be derived from a database 
having records of the format shown in Table 1: 

10 

TABLE 1 









Years 


Vested 


Other 


EmployecID 


Age 


Salary 


Employed Pension 


Attributes 


xxx-xx-xxxx 


46 


39K 


15 


100K 




YYY-YY-YYYY 


40 


59K 


4 


OK 




QQQ-QQ-QQQQ 


57 


88K 


23 


550K 





The two dimensions that are plotted in FIG. 3 A are years 
of employment and salary in thousands of dollars. One can 20 
visually determine that the data in FIG. 3A is lumped or 
clustered together into three clusters Clusterl, Cluster2, and 
Cluster3. FIG, 3B illustrates the same data points depicted in 
FIG. 3 A and also illustrates an added data point or data 
record designated Q. A standard question one might ask of 25 
the data mining system 11 would be what is the nearest 
neighbor (NN) to Q in the database 10? To answer this 
question in an efficient manner that does not require a 
complete scan of the entire database 10, the invention 
utilizes knowledge obtained from a clustering of the data in 30 
the database. 
Database Clustering 

One process for performing the clustering step 12 of the 
data stored in the database 10 suitable for use by the 
clustering component uses a K- means clustering technique 35 
that is described in co-pending United States patent appli- 
cation entitled "A scalable method for K-means clustering of 
large Databases" that was filed in the United States Patent 
and Trademark Office on Mar. 17, 1998 under application 
Ser. No. 09/042,540 now, U.S. Pat. No. 6,012,058, and 40 
which is assigned to the assignee of the present application 
and is also incorporated herein by reference. 

A second clustering process suitable for use by the 
clustering component 12 uses a so-called Expectation- 
Maximization (EM) analysis procedure. E-M clustering is 45 
described in an article entitled "Maximum likelihood from 
incomplete data via the EM algorithm", Journal of the Royal 
Statistical Society B, vol 39, pp. 1-38 (1977). The EM 
process estimates the parameters of a model iteratively, 
starting from an initial estimate. Each iteration consists of an 50 
Expectation step, which finds a distribution for unobserved 
data (the cluster labels), given the known values for the 
observed data. Co-pending patent application entitled "A 
Scalable System for Expectation Maximization Clustering 
of Large Databases* 1 filed May 22, 1998 under application 55 
Ser. No. 09/083,906 describes an E-M clustering procedure. 
This application is assigned to the assignee of the present 
invention and the disclosure of this patent application is 
incorporated herein by reference. 

In an expectation maximization (EM) clustering analysis, 60 
rather than harshly assigning each data point in FIG. 3A to 
a cluster and then calculating the mean or average of that 
cluster, each data point has a probability or weighting factor 
that describes its degree of membership in each of the K 
clusters that characterize the data. For the EM analysis used 65 
in conjunction with an exemplary embodiment of the present 
invention, one associates a Gaussian distribution of data 
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about the centroid of each of the K clusters in FIG. 3 A. EM 
is preferred over K-Means since EM produces a more valid 
statistical model of the data. However, Clustering can be 
done using any other clustering method, and then the cluster 
centers can be parametrized by fitting a gaussian on each 
center and estimating a covariance matrix from the data. EM 
gives us a fully parameterized model, and hence is the 
presently the preferred procedure. 

Consider the one dimensional depiction shown in FIG. 5. 
The three Gaussians Gl, G2, G3 represent three clusters that 
have centroids or means XI, X2, X3 in the salary attribute 
of 42K, 58K, and 78K dollars per year. The compactness of 
the data within a cluster is generally indicated by the shape 
of the Gaussian and the average value of the cluster is given 
by the mean. Now consider the data point identified on the 
salary axis as the point "X" of a data record having a salary 
of $45,000. The data point "belongs* to all three clusters 
identified by the Gaussians. This data point 'belongs' to the 
Gaussian G2 with a weighting factor proportional to h2 
(probability density value) that is given by the vertical 
distance from the horizontal axis to the curve G2. This same 
data point X 'belongs* to the cluster characterized by the 
Gaussian Gl with a weighting factor proportional to hi 
given by the vertical distance from the horizontal axis to the 
Gaussian Gl. The point 'X' belongs to the third cluster 
characterized by the Gaussian G3 with a negligible weight- 
ing factor. One can say that the data point X belongs 
fractionally to the two clusters Gl, G2. The weighting factor 
of its membership to Gl is given by hl/(hl+h2+Hrest); 
similarly it belongs to G2 with weight h2/(hl+h2+Hrest). 
Hrest is the sum of the heights of the curves for all other 
clusters (Gaussians). Since the height in other clusters is 
negligible one can think of a "fraction** of the case belonging 
to cluster 1 (represented by Gl) while the rest belongs to 
cluster 2 (represented by G2). For example, if hi =0.13 and 
h2=0.03, then 0.13/(0.13+0.03)=0.8 of the case belongs to 
cluster 1, while 0.2 of it belongs to cluster 2. 

The invention disclosed in the above referenced two 
co-pending patent applications to Fayyad et al brings data 
from the database 10 into a computer memory 22 (FIG. 1) 
and the clustering component 12 creates an output model 14 
from that data. The clustering model 14 provided by the 
clustering component 12 will typically fit in the memory of 
a personal computer. 

FIGS. 4A-4D illustrate data structures used by the 
K-means and EM clustering procedures disclosed in the 
aforementioned patent applications to Fayyad et al. The data 
structures of FIGS. 4A-4C are used by the clustering 
component 12 to build the clustering model 14 stored in a 
data structure of FIG. 4D. Briefly, the component 12 gathers 
data from the database 10 and brings it into a memory region 
that stores vectors of the data in the structure 170 of FIG. 4C. 
As the data is evaluated it is either summarized in the data 
structure 160 of FIG. 4Aor used to generate sub-clusters that 
are stored in the data structure 165 of FIG. 4B. Once a 
stopping criteria that is used to judge the sufficiency of the 
clustering has been achieved, the resultant model is stored in 
a data structure such as the model data structure of FIG. 4D. 
Probability Function 

Each of K clusters in the model (FIG. 4D) is represented 
or summarized as a multivariate gaussian having a prob- 
ability density function: 
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Equation 1: Equation 3: 



(tor' 2 VET 



Dist{x,y)= ZiXj-ytf 
5 y i=i 

where x=(x 1 ,x 2 ,x 3 ,x 4 , . . . ,x„) is a n-component column where x and y are data vectors. The invention supports a 
matrix corresponding to a data point in the selected n fairly wide family of distance measures. The assumption is 

dimensional space of the database, pi is the n-component 1Q made that the distance is of quadratiac normal form, i.e. it 
column matrix corresponding to a data structure 154 having can be written as: Dist(x,q)=(x-q) T D(x-q) with D being a 
the means (averages) of the data belonging to the cluster in positive semi-definite matrix, i.e. for any vector x, x r Dx^0. 
each of the n dimensions (designated SUM in FIG. 4D) and For Euclidean distance, D is diagonal with all entries being 
sigma (2) is an n-by-n covariance matrix that relates how the 1 . 

values of attributes in one dimension are related to the values 15 Use of a Euclidean weighted distance measure does not 
of attributes in other dimensions for the points belonging to change the results other than a pre-scaling of the input space: 
the cluster. The transpose of a matrix 2 is represented by 2 r , 
and the inverse of a matrix 2 is represented by 2 1 . The Equation 4: 

determinant of a matrix 2 is represented by |2|. The cova- 

riance matrix is always symmetric. The depiction of FIG. 6 20 Dist[x, y) = j £ w^Xi-y^ 

represents this Gaussian for the two clusters Gl, G2 in one ' V 

dimension. 

The number of values required to represent each cluster is ^ ^ of a wc j gh ting factor allows certain dimensions of 
the sum of the following quantities: the number N (one me n attributes to be favored or emphasized in the nearest 

number) indicating the data records summarized in a given 2 5 neighbor determination. When the distance from a data point 
cluster. (In K-means clustering this is an integer in E-M to a cluster ^th center^, the Euclidean distance is used: 
clustering a floating point number) The dimension n equals 
the number of attributes in the data records and is equal to 
the width of a SUM data structure (FIG. 4D) of the model Equation 5: 

14. There are n*(n+l)/2 values for the covariance matrix 2 30 r~ n 

which give a total of l+n+[n*(n+l)]/2 values in all. If the Disr{x, /*) = I V (*,- -rf) 2 

covariance matrix is diagonal (FIG. 4D for example), then V ,=i 

there are n numbers in the covariance matrix 156 and the 
number of values needed to characterize the cluster is 

reduced to l+2n. It is also possible to represent a full 35 ^ cluster membership function is given by: 
covariance matrix (not necessarily diagonal) if space allows. Equation 6: 

Returning to the example of FIGS. 3A and 3B, one would , (A ,~ 

in principle need to scan all data points to find the nearest 

neighbor point (NN) closest to query point Q. Instead of Equation 6 assigns a data point x to the cluster with highest 

scanning the entire database, use of the clustering model, 40 probability. This step 16 is known in the literature as an 
however, allows the nearest neighbor query process to scan optimal Bayes decision Rule. (FIG. 2) The data points are 
only cluster #2. This avoids scanning 66% of data in the partitioned into regions corresponding to predicted cluster 
database assuming each cluster has 33% of data in it. In a membership. 

situation wherein the cluster number K is larger the process The clustering step is followed by a step of creating a new 

is even more efficient. 45 column in the database that represents the predicted cluster 
Scanning of a cluster for the nearest neighbor implies a membership. In accordance with the exemplary embodiment 
knowledge of cluster assignment for all points in the data- of the P rcscat invention, the database 10 is rebuilt using the 
base 10. The properties summarized in FIGS. 7 and 8 allow newl y added column, and the cluster number for the basis of 
a probability density based indexing method. The data is a CLUSTERED INDEX as that term is used in the database 

modeled as a mixture of Gaussians. A clustering algorithm 50 field * Table 4 (below) illustrates the records of table 1 with 
such as scalable K-means or EM is used to cluster the data. an augmented field or attribute of the assigned cluster 
The clustering allows a probability density function for the number. 



TABLE 4 



database to be calculated. 

The model for each cluster is a Gaussian. Recall that each ss 
cluster 1 has associated with it a probability function of the 
form: 

Equation 2: 

A new database 10' (FIG. 2) is created that includes a 
clustered index based upon the augmented attribute of 
where pi is the mean of the cluster, and 2' designates the cluster number that is assigned to each data point in the 
covariance matrix. It is assumed the data is generated by a 65 database. 

weighed mixture of these Gaussians. A distance measure is At query time, the invention scans the region or cluster 
assumed to be of the following form: most likely to contain the nearest-neighbor. The scanning is 









Years 


Vested 


Cluster 


EmployeelD 


Age 


Salary 


Employed Pension 


Number 


xxx xx-xxxx 


46 


39K 


15 


100K 


#1. . . 


YYY-YY-YYYY 


40 


59K 


4 


OK 


#2 


QQQ-QQ-QQQQ 


57 


88K 


23 


550K 


#3 
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repeated until the estimated probability that the nearest- and a positive definite covariance matrix 2,-. The optimal 

neighbor is correct exceeds some user defined threshold. decision rule for a mixture of Gaussians pdf is presented in 

Typically only a very small number of clusters or regions Duda and Hart, Pattern Classification and Scene Analysis, 

will need to be visited. The approach is applicable to John Wiley and Sons, New York, 1973. It dictates that a data 

K-nearest neighbor queries as well since relatively little 5 point x is assigned to Cluster C x if i=4 maximizes the 

additional overhead is needed to find the closest K data quantity: 
points within the cluster to which the data point Q is 

assigned. This invention also supports a file-based imple- viCx-^^Hx^O-H log fcl+iog/v Equations: 
mentation and does not require a database system. In this 

case, instead of writing a new column, the data from each Note thal mis decision rule maximizes the probability that 

cluster is written into a separate file. The subsequent dis- 10 x fe generated by the with Gaussian distribution in Equation 

cussion refers to the extra column attribute, but simply l(cluster Q) and in that sense it is an optimal decision. This 

replacing the step "scan cluster X M with "read data from file decision rule defines a set of K regions R l9 . . ,R* in 

X" is an equivalent implementation of this invention. high-dimensional Euclidean space where R,- contains the set 

When a query is to be processed, it is assumed the input of points assigned to the cluster Ci. Define R(x) to be the 

query of the point Q is a data record. The probability density 15 region-identification function; that is, each data point x 

model that is based upon the cluster model 14 of FIG. 4D is belongs to region R(x) where R(x) is one of the K regions 

used to determine cluster membership for the data point Q. R 1? . . . JH^ Finally, an index I(x) is defined such that I(x>i 

The process then scans the cluster most likely to contain the if R(x)=R„ namely, I has K possible values which indicate 

nearest neighbor (NN) based on a probability estimate. If the to wn i CD cluster the optimal Bayes rule assigns x. The 

probability that the nearest neighbor estimate has been found 20 sorting of the database by l(x) can be implemented by 

is above a certain threshold, the process returns the nearest adding a new attribute t0 tne database and by building 

neighbor based upon the scan of the cluster, if not, a next dustered indexes based on me cluster number . 

most hkely cluster is scanned. The distance to the nearest Tfae second co m of the mvention is activated at 

neighbor found so far m the scan is tracked. This distance ^ ft finds ^ ^ babilit a nearest nci hbor 

dennes a regKm around tne query point y wbicd is oesig- 25 of a dcnotc(J . , y Fof 

its description, use 

na ed the Q-ball m FIG 3B. The choice of distance , metric ^ ^onB(q>) t0 deno J a ^ nere * on and 

determines the shape of the ball The next cluster with which hayin ^ ^ denote £ ^ M of k ^ obs 

we have the highest probability of encountering a nearest M f ^ b e ^ Med learned b , he 

neighbor is next scanned. This probability is computed I using re ^ 0Qs ^ £ A of fc ^ found ^ 

a linear combination of non-central cm squares distribution 30 f 0 ii ows - 

which approximates the probability of a point belonging to NearestNeighbor (q; R . ^J) 

the cluster falhng within the Q-ball (say cluster 1 in FIG. , , \ 1 . ' . . . , 

3B). If the probability is smaller than some threshold e, for ^'.9 * fluster «Wd to q using the optimal 

example, the scan is terminated since this indicates the decision rule (Equation2); 

likelihood of finding nearer points is vanishinglysmatl. 35 2 - data ^ R y and determine the nearest neighbor n(q) 

Exemplary Process and its Probablistic Analysis m R i Md its distance r from q; 

This section describes in greater detail the process intro- 3. Set E ={R,}; 

duced above for answering nearest-neighbor queries in 4. While P(B(q,r) is Empty|e)>tolerance { 

sufficient detail to permit analysis. The process includes two a. find a cluster Cj not in E which minimizes P(B(q,r) 

main components. The first component takes a data set D as 40 ORj is Empty|e); 

input and constructs an index that supports nearest-neighbor b. scan the data in Rj; Set E=EU{Rj}; 

queries. The second component takes a query point q and c. if a data point closer to q is found in Rj, let n(q) be 

produces the nearest neighbor of that point with high prob- that point, and set r to be the new minimum distance, 

ability. } 

The index is constructed in three steps. 45 The quantity P(B(q,r) is Empty|e) is the probability that 

1. Produce a best-fit mixture-of-Gaussians probability B(q,r), which is called the query ball, is empty, given the 

density function (pdf) for the data; evidence e collected so far. The evidence consists simply of 

2 Use the optimal Bayes decision rule to assign each data the list of points included in the regions scanned so far. 

point to a cluster; and Before we show how to compute this quantity and how to 

3. Sort the data points by their cluster assignment. 50 compute Pr(B(q,r)nRi isEmpty|e);, the process is explained 

There are many ways to find a mixture-of-Gaussians pdf that using the simple example depicted in FIGS. 3A and 3B. In 

fits data (e.g., Thiesson et al., 1998,). The exemplary process this example, the optimal Bayes rule generated three regions 

uses a scalable EM clustering scheme (See Fayyad et al Rl, R2, and R3 whose boundaries are shown. A given query 

pending patent application) that was developed to classify point Q is found to reside in region Rl. The algorithm scans 

large databases for a variety of applications other than 55 Rl, a current nearest neighbor (NN) is found, a current 

nearest-neighbor queries. minimum distance r is determined, and a query ball B(q,r) is 

The outcome of this algorithm is a mixture-of-Gaussians pdf formed. The query ball is shown in FIG. 3B. If the process 

of the form: was deterministic it would be forced to scan the other 

regions, R2 and R3 since they intersect the query ball, 

g-y^jj 7. 60 Instead the process determines the probability that the ball is 

empty given the fact that the region Rl has been scanned, 

i Suppose this probability exceeds the tolerance. The algo- 

/(*) - 2j^'' c ( x I w. £f) rithm must now choose between scanning R2 and scanning 

R3. A choice should be made according to the region that 

65 maximizes the probability of finding a nearer neighbor once 

where p, are the mixture coefficients, 0<pj<l,2!p,Bl, and that region is scanned, namely, the algorithm should scan the 

G(x|^ l 2 1 ) is a multivariate Gaussian pdf with a mean vector region that minimizes P(B(q,r)ORi is Empty|e). This quan- 



01/20/2004, EAST Version: 1.4.1 



US 6,2 

11 

tity is hard to compute and so the algorithm approximate this 
quantity using equation 9 below. 

It is shown how to compute the approximation and 
analyze the difference between the computed quantity and 
the desired one. In this example, region R2 is selected to be 
scanned. The process then halts because P(B(q,r) is Empty|e) 
becomes negligible once Rl and R2 have been scanned. The 
basis for computing P(B(q,r) is Empty|e) is the fact that data 
points are assumed to be randomly generated using the 
mixture-of-Gaussians pdf f (equation 1). In other words, 
take f to be the true model that generated the database. The 
sensitivity of the process to this assumption must be tested 
using real data sets. The probablity the ball is empty is the 
product of the probabilities that each point does not fall in 
the ball, because, according to the assumption, the x/s are 
did samples from f(x). Consequently, one has: 

Equation 9: 

FXB(g t r)bEmpty\ e) = fj [1 - POa e B(q. r) \ Xi e 
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and that the Euclidean distance metric is used (D is the 
identity matrix), but the needed probability is computable 
for the general case. In the general case: 

5 Equation 13: 

n 

(x-q) T D(x-q) = Y,(xj-<ijj t 

10 Where q,- is the jth component of q and x, is transformed to 
a standard normal random variable z, of the form: 

Equation 14: 

Where §f*(Pj-<\$Oj and is the jth component of p. Now 
20 (Z-6 y ) 2 has a noncentral chi-square distribution with 1 
degree of freedom and noncentrality parameter 6y 2 . 
The final result is: 



Where R(x,-) is the region of x ( and n is the number of data 
points. If R(x;) has been scanned then P(x I -€B(q^)|x i €R(x i ))= 
0. In general, P(x i eB(q,r)|x t eR(x i )) is not computable. 
Fortunately, it is acceptable to use the approximation: 



*R(Xi)) approx=P(x; c ^( ( J')l x « generated by Cy) 



Equation 10: 



Where C f is the cluster assigned to x ; using the optimal 
Bayes decision rule. 

Using the same reasoning as above and the fact that 
P(x^B(q,r)nR / |x i eR(x i ))=P(x^B{q ^Ix^R^), then one has: 



PiB(<hr)rW is EmptykHl-PfotffoOk, 



Equation 11: 



Which is approximately=[l-P(x ( eB(q,r)|x f generated by C y -)] 
" where n(R y ) is the number of points falling in region R f . 

The remaining task of computing P(xeB(q,r)|x generated 
by Cj has been dealt with in the statistical literature in a 
more general setting. This probability can be calculated 
numerically using a variety of approaches. In particular, 
numerical approaches have been devised to calculate prob- 
abilities of the form: 



Equation 12: 



X is a data point assumed to be generated by a multivari- 
ate Gaussian distribution G(x|^,a). D is a positive semi- 
definite matrix. In case Euclidean distance is used to mea- 
sure distances between data points, D is simply the identity 
matrix. However, the numerical methods apply to any dis- 
tance function of the form d(x,q)=(x-q) T D(x-q) where D is 
a positive semi-definite matrix. 

The pdf of the random variable (x-q) r D(x-q) is a chi 
squared distribution when D=I and q=jt*. It is a noncentral- 
ized chi squared distribution when D=I and q is not equal to 
fi. It is a sum of non-centralized chi-squared pdfs in the 
general case of a positive semi -definite quadratic form e.g. 
x r Dx>0 for any x. The general method uses a linear trans- 
formation to reduce the problem to calculating the cumula- 
tive distribution of a linear combination of chi-squares. The 
invention uses a method of Sheil and O'Murcheataigb. (See 
Quadratic Forms in Random Variables. Marcel Dekker, 
Inc., New York, 1992) The computation is illustrated assum- 
ing that the dimensions are uncorrelated (i.e. 2 is diagonal) 



60 



Equation 15: 

n 

P[(x - q) T Mx - ?) * f 2 ] = /> £ crjfiU <Sj) £ z 2 



30 This cumulative distribution function (cdf) can be expanded 
into an infinite series. The terms of the series are calculated 
until an acceptable bound on the truncation error is achieved. 
Cluster Stability 
Traditional (prior art) indexing methods for performing 
35 nearest neighbor searches conclude that scanning of the 
entire database must be used when the number of dimen- 
sions of the data increase. In Berchtold et al "A cost model 
for nearest neighbor search in high-dimensional data space", 
ACM PODS Symposium on Principles of Database 
Systems, Tucson Ariz., 1997, an approximate nearest neigh- 
40 bor model on assumed uniform distribution of data found 
that in high dimensions a fill scan of the database would be 
needed to find the nearest neighbor of a query point. A key 
assumption of the Berchtold et al work is that if a nearest- 
neighbor query ball intersects a data page the data page must 
45 be visited in the nearest neighbor search. 

Consider the example illustrated in FIG. 9A. This figure 
assumes a set of data points resides densely in an 
d-dimensional sphere of radius r and that to perform a 
nearest neighbor search a minimum bounding rectangle is 
50 constructed around this sphere. The two dimensional case is 
shown on the left of FIG. 9A. Note the ratio of the volume 
of the inscribed sphere to the volume of the minimum 
bounding box used to perform the nearest neighbor search 
converges rapidly to 0 as d goes to infinity. Define the 
55 'corners' of the bounding box as the regions outside the 
sphere but still in the box. When d (dimensionality) is large, 
most of the volume is in the corner of the boxes and yet there 
is no data in these corners. FIG. 9 A at the right suggests the 
growth of the corners. In fact both the number and size of the 
corners grow exponentially with dimensionality. 

Any fixed geometry will never fit real data tightly. In high 
dimensions, such discrepancies between volume and density 
of data points start to dominate. In conclusion, indices based 
on bounding objects are not good enough, because they 
frequently scan data pages where no relevant data resides. 
65 By introducing the concept of cluster stability the dimen- 
sionality problem is addressed and used to determine when 
scanning of the database is necessary. If the data in the entire 
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database is ge aerated from a single Gaussian, then scanning Number of Clusters 

of the database is necessary. In a mixture of Gaussians, each From the perspective of indexing, the more clusters one 

data point or query is generated by a single Gaussian, The has, the less data one must scan. However, determining 

points that are being clustered can be thought of as being which cluster to scan next requires a lookup into the model. 

generated by the Gaussian that generated them. It is assumed 5 K there are too many clusters, this lookup becomes too 

that the data point and the query points arc generated by the expensive. Consider an extreme case where each point in the 

same distribution. If the distance between data points in the database is its own cluster In this case no data will need to 

same cluster approaches the mean cluster separation dis- be ^ ? mode J "testifies the result directly. 

tance as the dimensionality increases the clusters is said to However, the lookup into the model is now as expensive as 

be unstable since every point is the same distance apart. 10 ™ enure d <" abase - . 

„. . , i i . . \_ r # ^ . 4 c r t Generally, the number of clusters is chosen to be between 

Similarly the distance between any two points from two 5 ^ 1Q0 ^ e CQSt of uti probabilities from lhe 

distinct cluster approaches the mean between cluster dis- mo(Jel iQ ^ ^ tf faifly negligible Note ^ ^ 5 

tance * clusters, assuming well-separated clusters, one can expect an 

If for two clusters, the between cluster distance dominates g0% m ^ CQst So not many dllsteis are necded . 

the within cluster distance, the clusters are stable with is The tradeoff between model lookup time and data scan cost 

respect to each other. In FIG. 9B the points in cluster 1 are can ^e optimized oo a per application basis, 

stable and in clusters 2 and 3 there are unstable. Types of Queries Supported 

Assume q d and x dfi are generated by cluster i and X dJ is The query data record in nearest-neighbor applications 

generated by cluster j. If can be either from the same distribution the data came from 

20 or from a different distribution. In either case, the invention 

\Ua - ftill 2 \Ud - qA\ 2 works well since the probabilities discussed above work for 

^jj— — -* 1 and -q d \\*) ~* > 1 am / fixed query points. If the query is from the data used to 

generate the probabilities, the invention will find the nearest 
neighbor in a scan of one cluster. This is true if the query 

Then the clusters i and j are pairwise stable with param- 2 s comes from the same distribution that the cluster model is 

eter 5 and for any e>0. drawn from and the distribution is stable. 

Assume one has a cluster model with a set of clusters C, 

iim d ~M\forqA?^e)\\* d j-q£)-i and Let |Cj, I~l, . . . ,K, denote the size of C, for C^C. The 

|C| is the summation of |C t J over all the Q . If one defines S, 

If every cluster is stable with respect to at least one other to te me ^po^ of data mat are membcr of q, i.e. 

cluster then ite nearest neighbor search will be well defined. S HC,.|/|q, then we expect that on an average query, only 
With probability 1, the ratio of the farthest and nearest 
neighbors is bigger than some constant greater than 1. For 

example in FIG. 9 A, Cluster 1 is stable with respect to both "psf 
Clusters 2 and 3 so a nearest neighbor is well defined. 

Assume a query point q d and a data point X^- are 

generated by a cluster i and a data point x d , is generated by .... , ™ - At _. . A . 4 

cluster j. If cluster I is unstable with itself and I there exists ^ scanncd - ™? rcason ,h £ ls th * 1 wth P"**^ 

. t J . , . . . t . . . . . . . ( , . Si, the query will come from Ci, and scanmng Ci is 

a cluster l that is pairwise stable with l with parameter o>l, - i ^- o- flt j 4 

4 . r J n equivalent to scanning a portion Si of the data. 

then tor any e>0: ~ n \, „ . . . . , ,. 

J 40 Generally one expects that most queries in high dimen- 
sion will come from the same distribution as the data itself 

DrnwAv >/r \™4TKT i_i Also, dimensions are not independent and some combina- 

tions of values may not be realistic. In a database of 

If every cluster is stable with respect to every other cluster demographics, for example, one does not expect to see a find 

then if a point belongs to one cluster, its neareste neighbor 45 similar 1 uerv °n an entry for a record whose value for age 

also belongs to that cluster. Therefore if the data is parti- fe 3 and whosc income is $50K. An advantage of practice of 

tioned by cluster membership then with probability one the me invention is in a situation where data is not stable and a 

index will only need to visit one cluster to find the nearest bad q uerv ^ requested, the invention realizes the neareste 

neighbor. With probability one (i.e. certainty) other clusters neighbor may reside in many clusters and the process 

can be skipped and no false drops of points in the query 50 switches to a sequential scan of all the data in the database, 

occur. Testing of the Exemplary Embodiment 

Assume q d and x d t are generated by cluster i. If cluster I ^ tables of FIGS - 7 8 summarize results of a test of 

is unstable with itself and pairwise 5ij>8>l stable with every nearest-neighbor searches using an exemplary embodiment 

other cluster j, j not equal to I, then q, nearest neighbor was of ^ invention. The goal of these computations experi- 

also generated by cluster i. Specifically for any point X dJ 55 ments was f° confirm the validity of practice of the inven- 

from cluster j not in i: t * on ' Experiments were conducted with both synthetic and 

real world databases. The purpose of the synthetic databases 

F[\\K dtr qJi^ (b-^ftxaj-qj^i. was to study the behavior of the invention in well understood 

situations. The real data sets were examined to assure the 

These results show that if one has a stable mixture of 60 assumptions were not too restrictive and apply in natural 

Gaussians where the between cluster distance dominates the situations. 

within cluster distance, and if a cluster partition membership The synthetic data sets were drawn from a mixture of ten 

function is used to assign all data generated by the same Gaussians. One on set of tests stable clusters were generated 

Gaussian to the same partition, the index can be used for and then unstable clusters from a known generating model 

nearest neighbor queries generated by the same distribution. 65 were used. Additionally, clusters from unknown generation 

The higher the dimensionality, the better the nearest neigh- models were used wherein the density model had to be 

bor query works. estimated. What it means for the clusters to be stable and 
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unstable are discussed in this application. Briefly, each tables confirms a significant benefit is achieved by clustering 
Gaussian has a covarianace matrix crl where I is the identity data in a large database if query by example or nearest- 
matrix. The dimensionality of the data is given by d and the neighbor queries are to be performed, 
distance between means or centroids of the clusters is t+ If Alternative Embodiment 

x d >a (^) in * tnen me clusters are stable. 5 Further efficiency can be obtained through careful orga- 

In one experiment, the size of the database was fixed and nization of the data within each cluster or region. In an 

was chosen to be 500,000/d for d, the dimension was less alternative embodiment of the invention, instead of scanning 

thaQ ^^f^ thaD ° r * qU ?l 0 }°° ^ da ! abaSe SiM an entire cluster to find the nearest neighbor, the process 

was 1 000,000/d. The process of finding the closest two scans t of a clustef and tben determines the next part of 

neighbors to a given data point is performed for 250 query w a ^ ^ ^ q{ ^ qi 

points randomly selected from the database. In every case ^ * 6Ashows data 

the two nearest neighbors were found by examining only ' " J. 7 " T ^ 77 u^o^y 

one cluster. In addition the process correctly determined that m 2 dimensions In the .alternative embodiment of 

no additional clusters needed to be searched. Since each FIG. 6A, one only needs to visit "slices of a cluster, 

cluster contained ten percent of the data, each query required Shcin 8 of the dusters 15 performed by defining mutual 

a scan of ten percent of the data. This test confirmed that 15 hyperplanes and equiprobability regions. The cluster slicing 

when the clusters are stable and the model is known the process is performed by slicing each cluster into a set of 

process of finding nearest neighbors works very well. iso-probability regions S (shells or donuts), and then using 

To test how the process works when the clusters are inter-cluster hyperplanes H to cut them further into subre- 

unstable. The amount of overlap of the Gaussians that model gions. FIG. 6A shows an example of slicing one cluster into 

the clusters grows exponentially as the number of dimen- 20 6 inter-cluster plane regions, and into an additional 5 iso- 

sions increases. To evaluate how well the invention works in probability slices, giving 30 parts of the cluster, 

such a situation, the databases that were generated were first Data can also De ordered within a particular cluster. The 

completely scanned to find the nearest neighbor and then the present invention is based upon the premise that it is possible 

invention used to find the nearest neighbor. Scanning of the to exploit the structure in data when data is clustered. Within 

clusters stops once the process finds the known nearest 25 a cluster one can ask how the data should be scanned. This 

neighbor. FIG. 7 tabulates the data. The figure also tabulates issue is ooe of practical database management since data 

an idealized value based not upon how much of the database will have to be laid out on disk pages. It should be possible 

must be scanned but the probability based predictor that to use a stopping criteria within a cluster and avoid scanning 

enough data has been scanned to assure with a tolerance that all pages. 

the correct nearest neighbor data point has been found. 30 Consider a query point that happens to be near the center 

Since the data gathered in FIG. 7 was from an unstable of a cluster. Clearly the data near the center should be 

situation, one would expect the process to require a scan of scanned, while data within the cluster but further from the 

much of the database. The chart below provides the per- center ma y not have to be scanned. This suggests a disk 

centage of time an error occurred in finding the nearest organization that is similar to the one illustrated in FIG. 6B. 

neighbor, i.e. the estimated nearest neighbor distance 35 The rational behind such a layout is that regions R get bigger 

exceeded the actual nearest neighbor distance. as they get farther from the center of the cluster. 

Furthermore, the central region CR should not be partitioned 
finely since theory suggests that data points close to the 

center are indistinguishable from each other by distance: i.e. 

Dimension 10 20 30 40 50 60 70 40 they are all essentially equidistant from a query. 

Accurccy 98.8 96.4 93.6 93.2 94.8 96.4 92.4 Exemplary DaU p^ing System 

With reference to FIG. 1 an exemplary data processing 

As seen in FIG. 7, the percentage of the data scanned system for practicing the disclosed data mining engine 

increased gradually with dimensionality. The ideal process invention includes a general purpose computing device in 

predicted by theory scanned less data. This difference 45 the form of a conventional computer 20, including one or 

between the ideal process and the present invention indicates more processing units 21, a system memory 22, and a system 

the invention probability estimate is conservative. In the bus 23 that couples various system components including 

experiments summarized in FIG. 8, the data were generated the system memory to the processing unit 21. The system 

from ten Gaussians with means or centroids independently bus 23 may be any of several types of bus structures 

and identically distributed in each dimension from a uniform 50 including a memory bus or memory controller, a peripheral 

distribution. Each diagonal element of the covariance matrix bus, and a local bus using any of a variety of bus architec- 

2 was generated from a uniform distribution. The data is tures. 

therefore well separated and should be somewhat stable. The system memory includes read only memory (ROM) 

Clusters that were used were generated using the EM 24 and random access memory (RAM) 25. A basic input/ 

process of Fayyad et al without using knowledge of the data 55 output system 26 (BIOS), containing the basic routines that 

generating distribution except that the number of clusters helps to transfer information between elements within the 

was known and used in the clustering. The results of the test computer 20, such as during start-up, is stored in ROM 24. 

on 10 to 100 dimensions are given in FIG. 8. The distribu- The computer 20 further includes a hard disk drive 27 for 

tion is stable in practice and the two nearest neighbors were reading from and writing to a hard disk, not shown, a 

located in the first clusters and scanning was stopped. 60 magnetic disk drive 28 for reading from or writing to a 

The testing summarized by these Figures show how much removable magnetic disk 29, and an optical disk drive 30 for 

savings is achieved by scanning the clusters in order of their reading from or writing to a removable optical disk 31 such 

likelihood of containing a nearest neighbor and then stop- as a CD ROM or other optical media. The hard disk drive 27, 

ping when the nearest neighbor is found with high confi- magnetic disk drive 28, and optical disk drive 30 are 

dence. Note, even if more that one cluster is scanned to find 65 connected to the system bus 23 by a hard disk drive interface 

the nearest neighbor, there is still a substantial savings over 32, a magnetic disk drive interface 33, and an optical drive 

performing a scan of the entire database. The data in these interface 34, respectively. The drives and their associated 
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computer-readable media provide nonvolatile storage of 
computer readable instructions, data structures, program 
modules and other data for the computer 20. Although the 
exemplary environment described herein employs a hard 
disk, a removable magnetic disk 29 and a removable optical 
disk 31, it should be appreciated by those skilled in the art 
that other types of computer readable media which can store 
data that is accessible by a computer, such as magnetic 
cassettes, flash memory cards, digital video disks, Bernoulli 
cartridges, random access memories (RAMs), read only 
memories (ROM), and the like, may also be used in the 
exemplary operating environment. 

A number of program modules may be stored on the hard 
disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 
including an operating system 35, one or more application 
programs 36, other program modules 37, and program data 
38. A user may enter commands and information into the 
computer 20 through input devices such as a keyboard 40 
and pointing device 42. Other input devices (not shown) 
may include a microphone, joystick, game pad, satellite 
dish, scanner, or the like. These and other input devices are 
often connected to the processing unit 21 through a serial 
port interface 46 that is coupled to the system bus, but may 
be connected by other interfaces, such as a parallel port, 
game port or a universal serial bus (USB). A monitor 47 or 
other type of display device is also connected to the system 
bus 23 via an interface, such as a video adapter 48. In 
addition to the monitor, personal computers typically 
include other peripheral output devices (not shown), such as 
speakers and printers. 

The computer 20 may operate in a networked environ- 
ment using logical connections to one or more remote 
computers, such as a remote computer 49. The remote 
computer 49 may be another personal computer, a server, a 
router, a network PC, a peer device or other common 
network node, and typically includes many or all of the 
elements described above relative to the computer 20, 
although only a memory storage device 50 has been illus- 
trated in FIG. 1. The logical connections depicted in FIG. 1 
include a local area network (LAN) 51 and a wide area 
network (WAN) 52. Such networking environments are 
commonplace in offices, enterprise-wide computer 
networks, intranets and the Internet. 

When used in a LAN networking environment, the com- 
puter 20 is connected to the local network 51 through a 
network interface or adapter 53. When used in a WAN 
networking environment, the computer 20 typically includes 
a modem 54 or other means for establishing communica- 
tions over the wide area network 52, such as the Internet. 
The modem 54, which may be internal or external, is 
connected to the system bus 23 via the serial port interface 
46. In a networked environment, program modules depicted 
relative to the computer 20, or portions thereof, may be 
stored in the remote memory storage device. It will be 
appreciated that the network connections shown are exem- 
plary and other means of establishing a communications link 
between the computers may be used. 

While the present invention has been described with a 
degree of particularity, it is the intent that the invention 
include all modifications and alterations from the disclosed 
implementations falling within the spirit or scope of the 
appended claims. 
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We claim: 

1. A method for evaluating data records contained within 
a database wherein each record has multiple data attributes; 
the method comprising the steps of: 

5 a) clustering the data records contained in the database 
into multiple data clusters wherein each of the multiple 
data clusters is characterized by a cluster model; and 
b) building a new database of records having an aug- 
mented record format that contains the original record 

io attributes and an additional record attribute containing 
a cluster identifier for each record based on the clus- 
tering step. 

2. The method of claim 1 wherein the cluster model 
includes a) a number of data records associated with that 

15 cluster, b) centra ids for each attribute of the cluster model 
and c) a spread for each attribute of the cluster model. 

3. The method of claim 1 additionally comprising the step 
of indexing the records in the database on the additional 
record attribute. 

20 4. The method of claim 1 additionally comprising the step 
of finding a nearest neighbor of a query data record by 
evaluating database records indexed by means of the cluster 
identifiers found during the clustering step. 

5. The method of claim 4 wherein the step of finding the 
nearest neighbor is performed by evaluating a probability 
estimate based upon a cluster model for the clusters that is 
created during the clustering step. 

6. The method of claim 5 wherein the step of finding the 
3 q nearest neighbor is performed by scanning the database 

records indexed by cluster identifiers having a greatest 
probability of containing a nearest neighbor. 

7. The method of claim 6 wherein the step of scanning 
database records is performed for data records indexed on 

35 multiple cluster identifiers so long as a probability of finding 
a nearest neighbor within a cluster exceeds a threshold. 

8. The method of claim 1 wherein the clustering process 
is performed using a scalable clustering process wherein a 
portion of the database is brought into a rapid access 

40 memory prior to clustering and then a portion of the data 
brought into the rapid access memory is summarized to 
allow other data records to be brought into memory for 
further clustering analysis. 

9. The method of claim 1 wherein the number of data 
45 attributes is greater than 10. 

10. The method of claim 4 wherein the step of finding the 
nearest neighbor is based upon quadratic distance metric 
between the query data record and the database records 

50 indexed by the cluster identifier. 

11. A method for evaluating data records stored on a 
storage medium wherein each record has multiple data 
attributes that have been characterized by a probability 
function found by clustering the data in the database to 

55 produce a clustering model; said method comprising the 
steps of assigning a cluster number for each of the records 
on the storage medium based upon the probability function, 
and finding a nearest neighbor from the data records for a 
query record by scanning data records from more than one 

60 based upon the probability function of the cluster model 
used to assign cluster numbers to the data records. 

12. The method of claim U wherein records having the 
same index are written to a single file on the storage 

65 medium. 

13. The method of claim 11 wherein the records are stored 
by a database management system and the index is used to 
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form a record attribute of records stored within a database 
maintained on the storage medium. 

14. The method of claim 11 wherein the clustering model 
for the database defines a probability function that is a 
mixture of Gaussians. 5 

15. The method of claim 14 wherein the assigning of a 
cluster number comprises a Bayes decision for assigning 
each data record to a data cluster associated with one of said 
Gaussians. 

16. The method of claim 11 wherein further comprising 
the step of subdividing data records within each cluster into 
cluster subcomponents and additionally comprising the step 
of finding a nearest neighbor of a query data by scanning 
records from a cluster subcomponent. 15 

17. The method of claim 11 wherein the step of deter- 
mining a nearest neighbor comprises the step determining a 
distance between the query data record and data records 
accessed in the database by means of the database index. 

18. The method of claim 11 wherein a multiple number of 20 
data records are stored as a set of nearest neighbors. 

19. The method of claim 11 wherein the nearest neighbor 
determination is based on a distance determination between 
a query record and a data record that comprises a quadratic 
normal form of distance. 25 

20. The method of claim 11 wherein the storage medium 
stores a database of data records and further comprises a step 
of finding a nearest neighbor to a query record from the data 
records of the database, said step of finding the nearest 
neighbor comprising the step of choosing between a sequen- 
tial scan of the database to find the nearest neighbor or 
searching for the nearest neighbor using the index derived 
from the probability function thereby optimizing the step of 
finding the nearest neighbor. 35 

21. A process for use in answering queries comprising the 
steps of: 

a) clustering data stored in a database using a clustering 
technique to provide an estimate of the probability 
density function of the sample data; ^ 

b) adding an additional column attribute to the database 
that represents the predicted cluster membership for 
each data record within the database; and 

c) rebuilding a data table of the database using the newly 
added column as an index to records in the table. 45 

22. The process of claim 21 wherein an index for the data 
in the database is created on the additionaJ column. 

23. The process of claim 21 comprising the additional step 
of performing a nearest neighbor query to identify a nearest 
neighbor data point to a query data point. 50 

24. The process of claim 22 wherein the nearest neighbor 
query is performed by finding a nearest cluster to the query 
data point. 

25. The process of claim 24 additionally comprising the 
step of scanning data in a cluster identified as most likely to 55 
contain nearest neighbor based on a probability estimate for 
said cluster. 

26. The process of claim 25 wherein if the probability that 
the nearest neighbor estimate is correct is above a certain 
threshold, the scanning is stopped, but if it is not, then 
scanning additional clusters to find the nearest neighbor. 

27. In a computer data mining system, apparatus for 
evaluating data in a database comprising: 

a) one or more data storage devices for storing data 65 
records on a storage medium; the data records includ- 
ing data attributes; and 
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b) a computer having a rapid access memory and an 
interface to the storage devices for reading data from 
the storage medium and bringing the data records from 
the storage medium into the rapid access memory for 
subsequent evaluation; 

c) the computer comprising a processing unit for evalu- 
ating at least some of the data records and for deter- 
mining a probability density function for the records 
based on a clustering of data from data in the database 
into multiple numbers of data clusters, and said com- 
puter programmed to build an index for the data records 
in the database based on the probability density 
function, wherein said computer performs an approxi- 
mate nearest neighbor analysis by choosing a specified 
cluster based on the index and evaluating records of the 
specified cluster for nearness to a given data record. 

28. The apparatus of claim 27 wherein said computer 
stores data records having a common index based on cluster 
number in a file of records not part of database table. 

29. The apparatus of claim 27 wherein the computer 
includes a database management component for setting up a 
database and using the index to organize data records in the 
database. 

30. The apparatus of claim 29 wherein the computer 
builds an additional database of records for storage on the 
one or more data storage devices and wherein the data 
records of the additional database are augmented with a 
cluster attribute. 

31. A computer-readable medium having computer- 
executable components comprising: 

a) a database component for interfacing with a database 
that stores data records made up of multiple data 
attributes; 

b) a modeling component for constructing and storing a 
clustering model that characterizes multiple data clus- 
ters wherein the modeling component constructs a 
model of data clustering that corresponds to a mixture 
of probability functions; and 

c) an analysis component for indexing the database on a 
cluster number for each record in the database wherein 
the indexing is performed based on a probability 
assessment of each record to the rnixture of probability 
functions and for approximating a nearest neighbor 
query by determining an index for a sample record and 
scanning data records previously assigned a similar 
index. 

32. The computer readable medium of claim 31 wherein 
said indexing component generates an augmented data 
record having a cluster number attribute for storage by the 
database component. 

33. The computer readable medium of claim 31 wherein 
said modeling component is adapted to compare a new 
model to a previously constructed model to evaluate whether 
further of said data records should be moved from said 
database into said rapid access memory for modeling. 

34. The computer readable medium of claim 31 wherein 
said modeling component is adapted to update said cluster 
model by calculating a weighted contribution by each of said 
data records in said rapid access memory. 
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35. A method for evaluating data records stored on a 
storage medium wherein each record has multiple data 
attributes that have been clustered to define a probability 
function of the data records stored on the storage medium; 
said method comprising the steps of evaluating the clusters 
of a clustering model and if the cluster separation between 
cluster centroids is of a sufficient size, assigning an index for 
each of the records on the storage medium based upon the 
probability function that is derived from the clustering 
model. 

36. A method for evaluating data records stored on a 
storage medium wherein each record has multiple data 
attributes that have been characterized by a probability 
function found by clustering the data in the database to 
produce a clustering model; said method comprising the 
steps of assigning a cluster number for each of the records 
on the storage medium based upon the probability function, 
and finding a nearest neighbor from the data records for a 
query record by choosing between a scan of a subset of the 
database and a complete scan of the database based on the 
probability estimate generated by a statistical model of the 
data. 
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37. A method for performing an approximate nearest 
neighbor search of data records in a database stored on a 
storage medium wherein each record has multiple data 
attributes comprising: 
5 a) clustering the data records to define a cluster model 
having multiple clusters made up of probability func- 
tions to form a compressed characterization of the data 
records of the database; 

b) assigning clusters numbers as indexes to the data 
records based on the probability functions; and 

c) searching data records from a cluster to find a nearest 
neighbor within said cluster of a sample data record 
based on a nearness of the sample data record to the 
clusters that make up the cluster model. 

15 38. The method of claim 37 wherein records from mul- 
tiple numbers of clusters are searched to find a nearest 
• neighbor to a sample data record. 

39. The method of claim 38 wherein the number of 
20 clusters searched is based on a probability that an actual 
nearest neighbor is contained within an as yet unscanned 
cluster. 
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